Introduction

If you recall back in the very first lecture, I introduced the concept of tidy data where each row in a dataset is an observation and each column is a variable. Tidy data is typically longer than it is wider.

For example, you could represent data on minimum wage by province and year like this.

Province 2019 2020 2021 2022
Alberta 14 15 16 17
Ontario 14.75 15 16.25 17
Quebec 14.25 15.5 16 17.5

This is a very common way to present and read data and it works great when you are presenting data in tabular form because we read from left to right.

However, to visualize complex forms of data, a tidy, or longer, structure is useful and necessary for using ggplot2.

The same dataset above looks like this when tidied.

Province Year Wage
Alberta 2019 14
Alberta 2020 15
Alberta 2021 16
Alberta 2022 17
Ontario 2019 14.75
Ontario 2020 15
Ontario 2021 16.25
Ontario 2022 17
Quebec 2019 14.25
Quebec 2020 15.5
Quebec 2021 16
Quebec 2022 17.5

We have already done some conceptual work early in the class distinguishing between observations and variables. This class is focussed more on the tactics to reshape data in different forms in R.

Data Import

We’re going to be working with some data on unemployment rates by year from Statistics Canada. The data are contained in = a csv file in the data subfolder which is pretty straightforward to read. But there are three problems:

  1. There is a bunch of junk above the actual numbers we don’t want.
  2. there is an annoying row with Percent stuck below the headers
  3. There is a bunch of junk below the actual numbers we don’t want.

All of these problems can be dealth with using the skip and the n_max functions.

Because we can see easily that the columns in this case run from 2017 to 2021, maybe we can skip importing the headers and that percentage row and we’ll just make our own names. That will solvce problem 1 and 2. And if we look down at the end, maybe we want to stop at the Unclassified row. It looks like there are just missing data anyway. So let’s just import rows 15 to 42.

Note, we will need the following libraries for the code to follow. Run this code to install these packages if they are not already installed.

install.packages("tidyverse")
install.packages("haven")
install.packages("here")
install.packages("knitr")

Once installed, they can be loaded with this code.

library(haven)
library(tidyverse)
library(here)
library(knitr)
unemployment<-read.csv(here("data/1410002301-eng.csv"), skip=14, nrows=42-15, header=F)

Let’s take a look:

View(unemployment)

So that worked well. Note that in the column there are some junky numbers at the end of the name of each sector. This would require a manipulation which we can talk about next class. For now we will leave it.

Let’s take another look with glimpse()

glimpse(unemployment)
## Rows: 27
## Columns: 8
## $ V1 <chr> "Total, all industries 5", "Goods-producing sector 6", "Agriculture…
## $ V2 <dbl> 6.4, 5.6, 4.4, 8.0, 11.2, 27.3, 5.6, 2.1, 7.7, 3.7, 3.4, 4.1, 3.5, …
## $ V3 <dbl> 5.9, 5.0, 4.0, 6.9, 10.9, 22.8, 4.8, 1.6, 6.6, 3.6, 3.3, 4.1, 3.3, …
## $ V4 <dbl> 5.7, 5.1, 4.4, 7.4, 11.3, 28.2, 4.9, 1.7, 6.8, 3.7, 3.4, 4.1, 3.3, …
## $ V5 <dbl> 9.5, 8.0, 5.1, 10.9, 10.6, 22.5, 9.9, 1.9, 9.7, 7.1, 7.4, 6.6, 7.1,…
## $ V6 <dbl> 7.5, 5.1, 4.7, 7.2, 9.1, 25.4, 5.2, 1.6, 5.8, 4.5, 4.4, 4.7, 3.9, 4…
## $ V7 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ V8 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

The variable names here are not very helpful and there are two columns filled with missing values. So we need to assign some names to these columns and we need to drop the last two columns which somehow got imported even though they are not necessary.

If you remember from last class, when we use the select() command we can also rename the variables we are selecting. So why don’t we just select the first five columns, renaming each column?

unemployment %>% 
  select(Sector=1, `2017`=2, `2018`=3, `2019`=4, `2020`=5)->unemployment
unemployment
##                                                      Sector 2017 2018 2019 2020
## 1                                   Total, all industries 5  6.4  5.9  5.7  9.5
## 2                                  Goods-producing sector 6  5.6  5.0  5.1  8.0
## 3                                             Agriculture 7  4.4  4.0  4.4  5.1
## 4     Forestry, fishing, mining, quarrying, oil and gas 8 9  8.0  6.9  7.4 10.9
## 5  Forestry and logging and support activities for forestry 11.2 10.9 11.3 10.6
## 6                             Fishing, hunting and trapping 27.3 22.8 28.2 22.5
## 7           Mining, quarrying, and oil and gas extraction 8  5.6  4.8  4.9  9.9
## 8                                                 Utilities  2.1  1.6  1.7  1.9
## 9                                              Construction  7.7  6.6  6.8  9.7
## 10                                            Manufacturing  3.7  3.6  3.7  7.1
## 11                                                 Durables  3.4  3.3  3.4  7.4
## 12                                             Non-durables  4.1  4.1  4.1  6.6
## 13                             Services-producing sector 10  3.5  3.3  3.3  7.1
## 14                               Wholesale and retail trade  3.8  3.8  3.7  7.6
## 15                                          Wholesale trade  3.4  2.7  2.5  5.7
## 16                                             Retail trade  4.0  4.1  4.1  8.2
## 17                           Transportation and warehousing  3.1  3.4  3.2  7.5
## 18      Finance, insurance, real estate, rental and leasing  2.3  1.9  1.6  3.3
## 19                                    Finance and insurance  2.3  1.8  1.6  2.6
## 20                       Real estate and rental and leasing  2.1  2.1  1.8  5.0
## 21          Professional, scientific and technical services  2.8  2.6  2.7  5.0
## 22         Business, building and other support services 11  7.0  5.7  6.2 10.0
## 23                                     Educational services  3.5  3.5  3.7  6.7
## 24                        Health care and social assistance  1.7  1.7  1.5  3.5
## 25                      Information, culture and recreation  5.5  5.0  5.6 13.1
## 26                          Accommodation and food services  5.8  5.4  5.6 17.9
## 27            Other services (except public administration)  3.8  2.9  2.8  7.9

Pivot Wide Data To Long

This is classically wide data and it is great for presenting information for tables, primarily because, in English, we read information from left to right. But for other purposes, it is not great. Another way to think about this is is that 2017 is not really a variable. Year is the variable and in the time we observe the unemployment rate, the variable of year takes on another value. So those years should be in one column and the unemployment rate should be in a separate column.

So we need to pivot the columns from 2017 to 2020 to make two new columns: a Year column and an Unemployment Rate column. In order to do this we need to specify which columns we want to pivot down.

unemployment %>% 
  pivot_longer(., cols=`2017`:`2020`)
## # A tibble: 108 × 3
##    Sector                   name  value
##    <chr>                    <chr> <dbl>
##  1 Total, all industries 5  2017    6.4
##  2 Total, all industries 5  2018    5.9
##  3 Total, all industries 5  2019    5.7
##  4 Total, all industries 5  2020    9.5
##  5 Goods-producing sector 6 2017    5.6
##  6 Goods-producing sector 6 2018    5  
##  7 Goods-producing sector 6 2019    5.1
##  8 Goods-producing sector 6 2020    8  
##  9 Agriculture 7            2017    4.4
## 10 Agriculture 7            2018    4  
## # … with 98 more rows

If you remember back in the selection on select() there are a bunch of helpers we can use to pick and choose the columns we want to pivot. Sometimes they are out of order; sometimes the names are long and messy; Sometimes they have a common prefix. So, here, we can also select by position. We want the 2 through 5th columns.

unemployment %>% 
  pivot_longer(., cols=2:5)
## # A tibble: 108 × 3
##    Sector                   name  value
##    <chr>                    <chr> <dbl>
##  1 Total, all industries 5  2017    6.4
##  2 Total, all industries 5  2018    5.9
##  3 Total, all industries 5  2019    5.7
##  4 Total, all industries 5  2020    9.5
##  5 Goods-producing sector 6 2017    5.6
##  6 Goods-producing sector 6 2018    5  
##  7 Goods-producing sector 6 2019    5.1
##  8 Goods-producing sector 6 2020    8  
##  9 Agriculture 7            2017    4.4
## 10 Agriculture 7            2018    4  
## # … with 98 more rows

Notice that the two new columns don’t have really helpful names. By default, pivot_longer() names the two new columns variable and value. But we can change these to suit our purposes. Here we use the names_to and the values_to argument. So one column will contain the names of the pivoted columns and one column will contain the values.

unemployment %>% 
  pivot_longer(., cols=`2017`:`2020`, names_to=c("Year"), values_to=c("Unemployment Rate"))->unemployment
head(unemployment)
## # A tibble: 6 × 3
##   Sector                   Year  `Unemployment Rate`
##   <chr>                    <chr>               <dbl>
## 1 Total, all industries 5  2017                  6.4
## 2 Total, all industries 5  2018                  5.9
## 3 Total, all industries 5  2019                  5.7
## 4 Total, all industries 5  2020                  9.5
## 5 Goods-producing sector 6 2017                  5.6
## 6 Goods-producing sector 6 2018                  5

Notice we have here a perfectly tidy data frame ready for graphing. The x can be the year, the y can be the rate and the color of the line could be the sector.

unemployment %>% 
  ggplot(., aes(x=Year, y=`Unemployment Rate`, col=Sector, group=Sector))+geom_line()

Obviously we would want to make it more readable, but that will come in the next few classes.

Example 2

Sometimes, even when data is properly tidy in that every variable is in a column, we still want to pivot it longer in order to facilitate visualazation. The reason is the entire philosophy of ggplot2 is to link any graphical element that varies with information to one column.

An example is this dataset that contains the immunization coverage rates for DTP and MMR in Toronto schools here.

The data are stored in the file: immunication-coverage-2017-2018.csv in the data subfolder.

immunization<-read.csv(file=here("data/immunization-coverage-2017-2018.csv"))
glimpse(immunization)
## Rows: 808
## Columns: 9
## $ X_id                             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12…
## $ School.Name                      <chr> "A Y JACKSON SECONDARY SCHOOL", "ACAD…
## $ Enrolled.population              <int> 1070, 110, 247, 66, 1253, 109, 180, 1…
## $ DTP.coverage.rate....            <dbl> 91.0, 88.2, 91.9, 71.2, 90.7, 98.2, 9…
## $ DTP.Religious.exemption.rate.... <dbl> 1.0, 1.8, 3.6, 19.7, 1.0, 0.0, 0.0, 0…
## $ MMR.coverage.rate....            <dbl> 95.7, 90.0, 94.3, 72.7, 97.8, 98.2, 9…
## $ MMR.Religious.exemption.rate.... <dbl> 1.0, 1.8, 3.6, 19.7, 1.0, 0.0, 0.0, 0…
## $ Lat                              <dbl> 43.80526, 43.76242, 43.68315, 43.7454…
## $ Lng                              <dbl> -79.36656, -79.17977, -79.28849, -79.…

Let’s clean the names of the data-set.

library(janitor)
immunization<-clean_names(immunization)

Notice here we have four different columns: the DTP coverage rate, the DTP Religious Exemption Rate, the MMR coverage rate and the MMR Religious Exemption rate.

How many variables ar ethere actually in these columns?

In reality, I would say there are three variables:

  1. The disease (DTP or MMR)
  2. The metric (the rate of coverage or the rate of exemptions)
  3. The rate itself

It might make sense to compare MMR coverage with MMR exemptions for each school and then DTP Coverage with DTP Exemptions for each school. Or it might make sense to compare the coverages of both with the religious exemptions of both. We can actually do a lot of combinations here.

Turning this into a proper tidy data frame is a bit more tricky, but we can do it.

The first thing to do is to take the four columns that have the measurements into one column. Let’s call the names_to column Measure and the values_to column Value. Remember, we can select variables in multiple ways.

We can select by position i.e. the fourth through the seventh column (I got that literally just by counting the columns in a View or a head() call.). We’ll do that, store the object and then also just show other ways to do it without saving the results.

immunization %>% 
pivot_longer(., cols=4:7, names_to=c("Measure"), values_to=c("Value")) ->immunization_long

We could do it by specifying the names of the variables to serve as the beginning and the ending:

immunization %>% 
pivot_longer(., cols=dtp_coverage_rate:mmr_religious_exemption_rate, names_to=c("Measure"), values_to=c("Value")) 
## # A tibble: 3,232 × 7
##     x_id school_name                    enrolled_pop…¹   lat   lng Measure Value
##    <int> <chr>                                   <int> <dbl> <dbl> <chr>   <dbl>
##  1     1 A Y JACKSON SECONDARY SCHOOL             1070  43.8 -79.4 dtp_co…  91  
##  2     1 A Y JACKSON SECONDARY SCHOOL             1070  43.8 -79.4 dtp_re…   1  
##  3     1 A Y JACKSON SECONDARY SCHOOL             1070  43.8 -79.4 mmr_co…  95.7
##  4     1 A Y JACKSON SECONDARY SCHOOL             1070  43.8 -79.4 mmr_re…   1  
##  5     2 ACADEMIE ALEXANDRE-DUMAS                  110  43.8 -79.2 dtp_co…  88.2
##  6     2 ACADEMIE ALEXANDRE-DUMAS                  110  43.8 -79.2 dtp_re…   1.8
##  7     2 ACADEMIE ALEXANDRE-DUMAS                  110  43.8 -79.2 mmr_co…  90  
##  8     2 ACADEMIE ALEXANDRE-DUMAS                  110  43.8 -79.2 mmr_re…   1.8
##  9     3 ADAM BECK JUNIOR PUBLIC SCHOOL            247  43.7 -79.3 dtp_co…  91.9
## 10     3 ADAM BECK JUNIOR PUBLIC SCHOOL            247  43.7 -79.3 dtp_re…   3.6
## # … with 3,222 more rows, and abbreviated variable name ¹​enrolled_population

We could also get really logical and pick out keywords that mark off our selections. In this case, we could say, if the variables contain dtp or mmr, then pivot them down.

immunization %>% 
pivot_longer(., cols=contains("mmr")|contains("dpt"), names_to=c("Measure"), values_to=c("Value")) 
## # A tibble: 1,616 × 9
##     x_id school_name           enrol…¹ dtp_c…² dtp_r…³   lat   lng Measure Value
##    <int> <chr>                   <int>   <dbl>   <dbl> <dbl> <dbl> <chr>   <dbl>
##  1     1 A Y JACKSON SECONDAR…    1070    91       1    43.8 -79.4 mmr_co…  95.7
##  2     1 A Y JACKSON SECONDAR…    1070    91       1    43.8 -79.4 mmr_re…   1  
##  3     2 ACADEMIE ALEXANDRE-D…     110    88.2     1.8  43.8 -79.2 mmr_co…  90  
##  4     2 ACADEMIE ALEXANDRE-D…     110    88.2     1.8  43.8 -79.2 mmr_re…   1.8
##  5     3 ADAM BECK JUNIOR PUB…     247    91.9     3.6  43.7 -79.3 mmr_co…  94.3
##  6     3 ADAM BECK JUNIOR PUB…     247    91.9     3.6  43.7 -79.3 mmr_re…   3.6
##  7     4 AFRICENTRIC ALTERNAT…      66    71.2    19.7  43.7 -79.5 mmr_co…  72.7
##  8     4 AFRICENTRIC ALTERNAT…      66    71.2    19.7  43.7 -79.5 mmr_re…  19.7
##  9     5 AGINCOURT COLLEGIATE…    1253    90.7     1    43.8 -79.3 mmr_co…  97.8
## 10     5 AGINCOURT COLLEGIATE…    1253    90.7     1    43.8 -79.3 mmr_re…   1  
## # … with 1,606 more rows, and abbreviated variable names ¹​enrolled_population,
## #   ²​dtp_coverage_rate, ³​dtp_religious_exemption_rate

Now let’s take a look at the reshaped data.

View(immunization_long)
A long version of the immunization dataset

A long version of the immunization dataset

That looks good, but notice that in our new Measure variable there are actually two variables tucked in there! Disease and Coverage versus Exemption! It would be useful to separate our Measure variable into two: one column for disease (e.g. dtp versus mmr) and the other column that indicates whether the rate in the value column might be the coverage rate or the exemption rate.

dplyr offers a function called separate() that does just that. It’s basically a version of Microsoft Excel’s text to columns feature.

?separate

The basic syntax is separate(data, col="columntoseparate", sep="symbol_to_separate_at", into=c("name_of_new_column" , "name_of_new_column2"))

If we look at the variable Measure we see that we could use the _ to separate out the bits of information. Let’s try this:

immunization_long %>% 
  separate(., col="Measure", sep="_", into=c("Disease", "Measure"))->immunization_long

We do get some warnings, but if we look at it, it basically looks pretty good.

View(immunization_long)
The results of separate.

The results of separate.

That looks quite good. In the real world, I would want to manipulate this to capitalize the terms and make them look nice (as we will practice next week). But for now we will just carry on.

Now, the beauty of this is that we can group the diseases and measurements any way we want.

immunization_long %>% 
  ggplot(., aes(x=Value, y=fct_reorder(school_name, Value), fill=Disease))+geom_col(position="dodge")+facet_grid(~Measure)

Obviously this is completely unreadable: but we can do a few things to show what we can do. One is we can stop printing the school names, and turn this into an interactive plot where a reader only receives the schoool names when hovering the click over a bar.

We need the plotly library.

#install.packages('plotly')
library(plotly)

We make our graph, exactly as above: and save it in plot1. Then run ggplotly on plot1

immunization_long %>% 
  ggplot(., aes(x=Value, y=fct_reorder2(school_name, Measure, Value, .desc=F), fill=Disease))+geom_col(position="dodge")+facet_grid(~Measure)+theme(axis.text.y=element_blank())->plot1
ggplotly(plot1)

This is also not really sufficient, but what I want to show you is how we could also compare the coverage and the exemptions by putting the disease, side-by-side.

immunization_long %>% 
  ggplot(., aes(x=Value, y=fct_reorder2(school_name, Disease, Value, .desc=F), fill=Measure))+geom_col(position="dodge")+facet_grid(~Disease)+theme(axis.text.y=element_blank())->plot2
ggplotly(plot2)

Long To Wide Data

Sometimes however, we need to go the other way. Statistics Canada often provides data on both raw numbers and percentages and it looks like this.

unemployment_long<-read.csv(file=here("data/unemployment_long.csv.csv"))
# Clean the names
unemployment_long<-clean_names(unemployment_long)

This is a data-set of unemployment rates and employment levels (e..g the number of people employed in a sector) for 2017 to 2021 in Canada. On the one hand, the value column helpfully includes one

If we look at it and scroll down to the bottom you will see the problem here.

View(unemployment_long)
A long dataset of unemployment rates in Canada.

A long dataset of unemployment rates in Canada.

We need to turn this long data into wide data with one column for unemployment and one column for employment.

names(unemployment_long)
##  [1] "ref_date"                                           
##  [2] "geo"                                                
##  [3] "dguid"                                              
##  [4] "labour_force_characteristics"                       
##  [5] "north_american_industry_classification_system_naics"
##  [6] "sex"                                                
##  [7] "age_group"                                          
##  [8] "uom"                                                
##  [9] "uom_id"                                             
## [10] "scalar_factor"                                      
## [11] "scalar_id"                                          
## [12] "vector"                                             
## [13] "coordinate"                                         
## [14] "value"                                              
## [15] "status"                                             
## [16] "symbol"                                             
## [17] "terminated"                                         
## [18] "decimals"

We need to specify that we want to take the names of the new column from the labour_force_characteristics variable and the values of the new columns have to come from the value column.

unemployment_long %>% 
  pivot_wider(., names_from=c("labour_force_characteristics"), values_from=c("value"))->unemployment_wide
View(unemployment_wide)
Problem with the wide dataset

Problem with the wide dataset

The problem is that the variable labour_force_characteristics and the variable uom (Unit of Measurement) are basically duplicate values. labour_force_characteristics has the values Employment which measures the number of people employed and the Unemployment Rate which measures the …unemployment rate. But the other variable uom has exactly the same information.

unemployment_long %>% 
  select(labour_force_characteristics, uom) %>% 
  slice_sample(n=20)
##    labour_force_characteristics        uom
## 1             Unemployment rate Percentage
## 2             Unemployment rate Percentage
## 3             Unemployment rate Percentage
## 4             Unemployment rate Percentage
## 5                    Employment    Persons
## 6                    Employment    Persons
## 7                    Employment    Persons
## 8                    Employment    Persons
## 9                    Employment    Persons
## 10                   Employment    Persons
## 11                   Employment    Persons
## 12                   Employment    Persons
## 13                   Employment    Persons
## 14            Unemployment rate Percentage
## 15            Unemployment rate Percentage
## 16            Unemployment rate Percentage
## 17                   Employment    Persons
## 18                   Employment    Persons
## 19                   Employment    Persons
## 20                   Employment    Persons

So, we end up with a variable called Employment and a variable called Unemployment Rate but there is also a variable called UOM and it has entries for both the new variables.

The quick and dirty fix for this is to just select the bare minimum of variables you need. In this way, we’ll avoid duplication.

unemployment_long %>% 
  select(ref_date, labour_force_characteristics, north_american_industry_classification_system_naics, age_group, value) %>% 
  pivot_wider(., names_from=c("labour_force_characteristics"), values_from=c("value"))->unemployment_wide

Now take a look.

View(unemployment_wide)
A wide dataset

A wide dataset

now, we could use either Employment or Unemployment Rate as a y variable depending on what we wanted.

unemployment_wide %>% 
  ggplot(., aes(x=ref_date, y=Employment, col=north_american_industry_classification_system_naics))+geom_line()

Or we could pick the unemployment rate.

unemployment_wide %>% 
  ggplot(., aes(x=ref_date, y=`Unemployment rate`, col=north_american_industry_classification_system_naics))+geom_line()